Vision transformer

A vision transformer (ViT) is a transformer designed for computer vision.[1] A ViT breaks down an input image into a series of patches (rather than breaking up text into tokens), serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

ViT has found applications in image recognition, image segmentation, and autonomous driving.[2]

  1. ^ Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv:2010.11929 [cs.CV].
  2. ^ Sarkar, Arjun (2021-05-20). "Are Transformers better than CNN's at Image Recognition?". Medium. Retrieved 2021-07-11.

© MMXXIII Rich X Search. We shall prevail. All rights reserved. Rich X Search